{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "kxu1Gfhx1pHg"
   },
   "source": [
    "<a href=\"https://colab.research.google.com/github/jeffheaton/app_generative_ai/blob/main/t81_559_class_13_3_speech2text.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ZbNbAV281pHh"
   },
   "source": [
    "# T81-559: Applications of Generative Artificial Intelligence\n",
    "**Module 13: Speech Processing**\n",
    "* Instructor: [Jeff Heaton](https://sites.wustl.edu/jeffheaton/), McKelvey School of Engineering, [Washington University in St. Louis](https://engineering.wustl.edu/Programs/Pages/default.aspx)\n",
    "* For more information visit the [class website](https://sites.wustl.edu/jeffheaton/t81-558/)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "HwnvSYEQ1pHi"
   },
   "source": [
    "# Module 13 Material\n",
    "\n",
    "Module 13: Speech Processing\n",
    "\n",
    "* Part 13.1: Intro to Speech Processing [[Video]](https://www.youtube.com/watch?v=ILNcv9zrMyQ&ab_channel=JeffHeaton) [[Notebook]](t81_559_class_13_1_speech_models.ipynb)\n",
    "* Part 13.2: Text to Speech [[Video]](https://www.youtube.com/watch?v=O5_oaK5fHqI&ab_channel=JeffHeaton) [[Notebook]](t81_559_class_13_2_text2speech.ipynb)\n",
    "* **Part 13.3: Speech to Text** [[Video]](https://www.youtube.com/watch?v=zor64w90fpQ&ab_channel=JeffHeaton) [[Notebook]](t81_559_class_13_3_speech2text.ipynb)\n",
    "* Part 13.4: Speech Bot [[Video]](https://www.youtube.com/watch?v=8fgxX6yLorI&ab_channel=JeffHeaton) [[Notebook]](t81_559_class_13_4_speechbot.ipynb)\n",
    "* Part 13.5: Future Directions in GenAI [[Video]](https://www.youtube.com/watch?v=T4AYP_XXTbg&ab_channel=JeffHeaton) [[Notebook]](t81_559_class_13_5_future.ipynb)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "uKwaz0NVNzA9"
   },
   "source": [
    "# Google CoLab Instructions\n",
    "\n",
    "The following code ensures that Google CoLab is running and maps Google Drive if needed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "UoioFHmgNzxC"
   },
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "try:\n",
    "    from google.colab import drive, userdata\n",
    "    COLAB = True\n",
    "    print(\"Note: using Google CoLab\")\n",
    "except:\n",
    "    print(\"Note: not using Google CoLab\")\n",
    "    COLAB = False\n",
    "\n",
    "# OpenAI Secrets\n",
    "if COLAB:\n",
    "    os.environ[\"OPENAI_API_KEY\"] = userdata.get('OPENAI_API_KEY')\n",
    "\n",
    "# Install needed libraries in CoLab\n",
    "if COLAB:\n",
    "    !sudo apt-get install portaudio19-dev python3-pyaudio\n",
    "    !pip install langchain langchain_openai openai pyaudio sounddevice numpy pydub"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "v2MPPX0c1pHi"
   },
   "source": [
    "# Part 13.3: Speech to Text\n",
    "\n",
    "In this module, we'll delve into the realm of speech-to-text technology, focusing on the powerful capabilities offered by OpenAI's models. Speech-to-text, also known as automatic speech recognition (ASR), is a technology that converts spoken language into written text. OpenAI's speech-to-text models represent the cutting edge of this field, leveraging advanced machine learning techniques to achieve high accuracy and robustness across various accents, languages, and acoustic environments. We'll explore how these models can be integrated into applications to enable voice-based interactions, transcription services, and accessibility features. By harnessing OpenAI's speech-to-text technology, we'll unlock new possibilities for human-computer interaction and demonstrate how to transform audio input into actionable text data with remarkable precision.\n",
    "\n",
    "\n",
    "Note we will make use of the technique described here to record audio in CoLab.\n",
    "\n",
    "https://gist.github.com/korakot/c21c3476c024ad6d56d5f48b0bca92be\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "w6ow8HFUNUs_"
   },
   "outputs": [],
   "source": [
    "from IPython.display import Javascript\n",
    "from google.colab import output\n",
    "from base64 import b64decode\n",
    "import io\n",
    "from IPython.display import Audio\n",
    "\n",
    "from pydub import AudioSegment\n",
    "\n",
    "RECORD = \"\"\"\n",
    "const sleep = time => new Promise(resolve => setTimeout(resolve, time))\n",
    "const b2text = blob => new Promise(resolve => {\n",
    "    const reader = new FileReader()\n",
    "    reader.onloadend = e => resolve(e.srcElement.result)\n",
    "    reader.readAsDataURL(blob)\n",
    "})\n",
    "var record = time => new Promise(async resolve => {\n",
    "    stream = await navigator.mediaDevices.getUserMedia({ audio: true })\n",
    "    recorder = new MediaRecorder(stream)\n",
    "    chunks = []\n",
    "    recorder.ondataavailable = e => chunks.push(e.data)\n",
    "    recorder.start()\n",
    "    await sleep(time)\n",
    "    recorder.onstop = async ()=>{\n",
    "        blob = new Blob(chunks)\n",
    "        text = await b2text(blob)\n",
    "        resolve(text)\n",
    "    }\n",
    "    recorder.stop()\n",
    "})\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "WGoIgiyQKGcK"
   },
   "source": [
    "The following code uses JavaScript to record audio for a specified amount of time."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "N1EA4kFKOH-E"
   },
   "outputs": [],
   "source": [
    "def record(seconds=3):\n",
    "    print(f\"Recording now for {seconds} seconds.\")\n",
    "    display(Javascript(RECORD))\n",
    "    s = output.eval_js('record(%d)' % (seconds * 1000))\n",
    "    binary = b64decode(s.split(',')[1])\n",
    "\n",
    "    # Convert to AudioSegment\n",
    "    audio = AudioSegment.from_file(io.BytesIO(binary), format=\"webm\")\n",
    "\n",
    "    # Export as WAV\n",
    "    audio.export(\"recorded_audio.wav\", format=\"wav\")\n",
    "    print(\"Recording done.\")\n",
    "    return audio"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "PBwAnTyBKLxP"
   },
   "source": [
    "The following code provides a quick test of this technique."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "EeO-I6RIOIzA"
   },
   "outputs": [],
   "source": [
    "# Record 5 seconds of audio\n",
    "audio = record(5)\n",
    "\n",
    "print(\"Recording complete. Audio saved as 'recorded_audio.wav'\")\n",
    "\n",
    "# Play back the recorded audio\n",
    "display(Audio(\"recorded_audio.wav\",autoplay=True))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "jurTAwUCKRM-"
   },
   "source": [
    "This code snippet demonstrates how to use OpenAI's speech-to-text API to transcribe audio files. It defines a function transcribe_audio that takes a filename as input. The function opens the specified audio file in binary mode and uses the OpenAI client to create a transcription. The client.audio.transcriptions.create() method is called with two parameters: the model (\"whisper-1\") and the audio file. Whisper is OpenAI's state-of-the-art speech recognition model, known for its robustness across various languages and accents. The function returns the transcribed text. In the example usage, an audio file named \"recorded_audio.wav\" is transcribed, and the resulting text is printed. This code provides a simple yet powerful way to convert speech to text, which can be invaluable for tasks such as generating subtitles, creating searchable archives of audio content, or enabling voice commands in applications."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "qrf898AGONJ6"
   },
   "outputs": [],
   "source": [
    "from openai import OpenAI\n",
    "\n",
    "client = OpenAI()\n",
    "\n",
    "def transcribe_audio(filename):\n",
    "    with open(filename, \"rb\") as audio_file:\n",
    "        transcription = client.audio.transcriptions.create(\n",
    "            model=\"whisper-1\",\n",
    "            file=audio_file\n",
    "        )\n",
    "    return transcription.text\n",
    "\n",
    "# Transcribe the recorded audio\n",
    "transcription = transcribe_audio(\"recorded_audio.wav\")\n",
    "print(\"Transcription:\")\n",
    "print(transcription)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "V_yZeTPmApe0"
   },
   "source": [
    "### Voice Change\n",
    "This code snippet demonstrates a complete workflow for speech-to-text and text-to-speech conversion using OpenAI's APIs. The generate_text function uses OpenAI's TTS-1 model with the \"nova\" voice to convert text into speech, returning the raw audio data. The speak_text function builds upon this by taking a text input, generating the corresponding audio, and then playing it using IPython's Audio display function with autoplay enabled. The main workflow begins by recording audio for 5 seconds (using a record function not shown in the snippet), transcribing it using the previously defined transcribe_audio function, printing the transcription, and finally speaking the transcribed text back using the speak_text function. This creates a full circle of voice interaction: recording speech, converting it to text, and then converting that text back into speech, effectively demonstrating both speech recognition and speech synthesis capabilities in a single, cohesive process."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "Ulv7qUwiAvAN"
   },
   "outputs": [],
   "source": [
    "\n",
    "\n",
    "def generate_text(text):\n",
    "    response = client.audio.speech.create(\n",
    "        model=\"tts-1\",\n",
    "        voice=\"nova\",\n",
    "        input=text\n",
    "    )\n",
    "    audio_data = response.content\n",
    "    return audio_data  # Return the audio data directly\n",
    "\n",
    "def speak_text(text):\n",
    "    audio_data = generate_text(text)\n",
    "    display(Audio(audio_data, autoplay=True))\n",
    "\n",
    "# Transcribe the recorded audio\n",
    "audio = record(5)\n",
    "transcription = transcribe_audio(\"recorded_audio.wav\")\n",
    "print(\"Transcription:\")\n",
    "print(transcription)\n",
    "speak_text(transcription)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "yss8dyUzLgm1"
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "colab": {
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
